## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
可以看到有1599条样本,每条样本有13个变量
绘制所有变量的直方图
绘制quality的直方图
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
可以看到,大部分(82%)的quality评分在5,6分
绘制fixed.acidity的直方图
##
## 7.2 7.1 7.8 7.5 7 7.7 6.8 7.6 8.2 7.3 7.4 7.9 8 8.3 6.9
## 67 57 53 52 50 49 46 46 45 44 44 42 42 40 38
## 6.6 8.8 8.9 9.1 6.7 8.6 8.1 8.4 9 9.9 6.4 8.7 10 9.3 10.4
## 37 34 33 29 28 27 26 26 26 26 25 24 23 22 21
## 6.2 8.5 10.2 6.5 9.4 9.6 6.1 9.2 9.8 5.6 6.3 9.5 10.6 6 11.5
## 20 19 19 17 17 17 16 16 15 14 14 14 14 13 13
## 10.5 11.6 11.9 10.3 10.1 10.7 10.8 5.9 9.7 11.1 10.9 11.3 12 12.5 5
## 12 12 12 11 10 10 10 9 9 9 8 7 7 7 6
## 5.2 5.4 11.2 11.4 12.3 12.8 5.1 5.3 5.8 12.2 12.4 12.6 12.7 11 11.7
## 6 5 5 5 5 5 4 4 4 4 4 4 4 3 3
## 11.8 13 13.2 13.3 5.7 12.9 13.7 15 15.5 15.6 4.6 4.7 4.9 5.5 12.1
## 3 3 3 3 2 2 2 2 2 2 1 1 1 1 1
## 13.4 13.5 13.8 14 14.3 15.9
## 1 1 1 1 1 1
可以看到,fixed.acidity的峰值出现在7.2,在16附近出现了一些异常值
绘制volatile.acidity的直方图
##
## 0.6 0.5 0.43 0.59 0.36 0.58 0.4 0.38 0.39 0.49 0.56 0.41
## 47 46 43 39 38 38 37 35 35 35 34 33
## 0.52 0.42 0.46 0.54 0.31 0.34 0.53 0.63 0.57 0.61 0.64 0.66
## 33 31 31 31 30 30 29 29 28 27 27 26
## 0.37 0.48 0.51 0.62 0.28 0.32 0.44 0.67 0.69 0.35 0.45 0.47
## 24 24 24 24 23 23 23 23 23 22 22 21
## 0.33 0.55 0.26 0.29 0.3 0.65 0.27 0.24 0.645 0.68 0.715 0.685
## 20 20 16 16 16 16 14 13 12 12 12 11
## 0.74 0.18 0.7 0.78 0.635 0.725 0.735 0.785 0.84 0.25 0.655 0.695
## 11 10 10 10 9 9 8 8 8 7 7 7
## 0.21 0.22 0.615 0.705 0.73 0.75 0.77 0.23 0.545 0.72 0.745 0.76
## 6 6 6 6 6 6 6 5 5 5 5 5
## 0.765 0.82 0.88 0.885 0.775 0.83 0.835 0.87 0.915 1.02 0.12 0.2
## 5 5 5 5 4 4 4 4 4 4 3 3
## 0.415 0.575 0.585 0.605 0.625 0.665 0.675 0.71 0.755 0.8 0.815 0.855
## 3 3 3 3 3 3 3 3 3 3 3 3
## 0.9 0.91 0.96 0.965 0.98 1 1.04 0.16 0.19 0.305 0.315 0.365
## 3 3 3 3 3 3 3 2 2 2 2 2
## 0.395 0.475 0.79 0.795 0.81 0.85 0.86 0.875 0.935 1.33 0.295 0.565
## 2 2 2 2 2 2 2 2 2 2 1 1
## 0.595 0.805 0.825 0.845 0.865 0.89 0.895 0.92 0.95 0.955 0.975 1.005
## 1 1 1 1 1 1 1 1 1 1 1 1
## 1.01 1.025 1.035 1.07 1.09 1.115 1.13 1.18 1.185 1.24 1.58
## 1 1 1 1 1 1 1 1 1 1 1
volatile.acidity的峰值出现在0.6, 在1.6左右出现了异常值
移除1%的异常值,再次绘制直方图
出现了近似对称的双峰直方图
绘制citric.acid的直方图
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
发现132个0值,和一个为1的异常值,这是一个多峰分布
绘制residual.sugar的直方图
##
## 2 2.2 1.8 2.1 1.9 2.3 2.4 2.5 2.6 1.7 1.6 2.8 2.7 1.4 1.5
## 156 131 129 128 117 109 86 84 79 76 58 49 39 35 30
## 3 2.9 3.2 3.4 3.3 4 1.2 3.6 3.8 4.3 5.5 3.1 3.9 4.1 4.6
## 25 24 15 15 11 11 8 8 8 8 8 7 6 6 6
## 5.6 1.3 4.2 5.1 3.7 4.4 4.5 5.8 6 6.1 4.8 5.2 5.9 6.2 6.4
## 6 5 5 5 4 4 4 4 4 4 3 3 3 3 3
## 7.9 8.3 0.9 1.65 1.75 2.05 2.15 3.5 4.65 6.3 6.55 6.6 6.7 7.8 8.1
## 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2
## 8.8 11 13.8 15.4 2.25 2.35 2.55 2.65 2.85 2.95 3.45 3.65 3.75 4.25 4.7
## 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
## 5 5.15 5.4 5.7 7 7.2 7.3 7.5 8.6 8.9 9 10.7 12.9 13.4 13.9
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 15.5
## 1
峰值出现在2,有很长的长尾
对residual.sugar做对数变换,然后再次绘制直方图
绘制chlorides的直方图
##
## 0.08 0.074 0.076 0.078 0.084 0.071 0.077 0.082 0.075 0.079 0.081 0.07
## 66 55 51 51 49 47 47 46 45 43 40 35
## 0.073 0.083 0.066 0.088 0.086 0.068 0.067 0.085 0.087 0.089 0.062 0.072
## 35 35 32 32 31 30 27 25 25 25 24 24
## 0.065 0.095 0.063 0.092 0.069 0.09 0.093 0.064 0.091 0.094 0.096 0.097
## 23 23 22 22 21 21 21 20 19 19 18 18
## 0.059 0.06 0.104 0.058 0.054 0.1 0.05 0.098 0.061 0.114 0.052 0.057
## 17 16 16 14 13 13 12 12 11 11 10 10
## 0.102 0.056 0.107 0.048 0.049 0.055 0.099 0.106 0.11 0.118 0.103 0.111
## 10 9 9 8 8 8 8 8 8 8 7 7
## 0.122 0.105 0.112 0.123 0.044 0.053 0.101 0.115 0.039 0.041 0.045 0.046
## 7 6 6 6 5 5 5 5 4 4 4 4
## 0.047 0.117 0.132 0.042 0.109 0.119 0.12 0.124 0.157 0.166 0.214 0.415
## 4 4 4 3 3 3 3 3 3 3 3 3
## 0.012 0.038 0.116 0.121 0.152 0.171 0.178 0.205 0.226 0.414 0.034 0.043
## 2 2 2 2 2 2 2 2 2 2 1 1
## 0.051 0.108 0.113 0.125 0.126 0.127 0.128 0.136 0.137 0.143 0.145 0.146
## 1 1 1 1 1 1 1 1 1 1 1 1
## 0.147 0.148 0.153 0.159 0.161 0.165 0.168 0.169 0.17 0.172 0.174 0.176
## 1 1 1 1 1 1 1 1 1 1 1 1
## 0.186 0.19 0.194 0.2 0.213 0.216 0.222 0.23 0.235 0.236 0.241 0.243
## 1 1 1 1 1 1 1 1 1 1 1 1
## 0.25 0.263 0.267 0.27 0.332 0.337 0.341 0.343 0.358 0.36 0.368 0.369
## 1 1 1 1 1 1 1 1 1 1 1 1
## 0.387 0.401 0.403 0.413 0.422 0.464 0.467 0.61 0.611
## 1 1 1 1 1 1 1 1 1
峰值处在在0.08,有很长的长尾
对chlorides做对数变换,然后再次绘制直方图
绘制free.sulfur.dioxide的直方图
##
## 6 5 10 15 12 7
## 138 104 79 78 75 71
free.sulfur.dioxide峰值出现在6,有长尾并出现了一些异常值
绘制total.sulfur.dioxide的直方图
##
## 28 24 15 18 23 14
## 43 36 35 35 34 33
free.sulfur.dioxide峰值出现在28,有长尾并出现了一些异常值。他和free.sulfur.dioxide分布类似,我觉得这两个变量具有相关性。
绘制density的直方图
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
正态分布,中位数0.9968,均值0.9967
绘制pH的直方图
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
正态分布,中位数3.310,均值3.311
绘制sulphates的直方图 有长尾,并且有异常值,用对数转换为近似正态分布,峰值出现在0.6附近
绘制alcohol的长尾
##
## 9.5 9.4 9.8 9.2 10 10.5
## 139 103 78 72 67 67
峰值出现在9.5,这个直方图的形状类似total.sulfur.dioxide和free.sulfur.dioxide
这个样本集有1599条样本,每条样本有13个变量。有一个quality的因子变量,范围从1到10 1. 变量citric.acid含有大量的0值 2. 变量density和pH服从正态分布 3. 变量residual.sugar,chlorides和sulphates有很长的长尾 4. 大部分(82%)的quality评分在5,6分
主要关心quality变量,想知道有哪些因素影响这个变量
可以看出quality和alcohol,volatile acidity,sulphates和citric acid相关性比较大
# quality/alcohol boxplot
qplot(x = quality, y = alcohol, data = wine, geom = 'boxplot')
# 根据quality显示alcohol的summary
by(wine$alcohol, wine$quality, summary)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
可以看出高quality的红酒相应的alcohol也高。除了quality为5的红酒,其他红酒的alcohol的中位数呈现升高的趋势,而且quality为5的红酒的异常值有很多。我觉得可能是样本的错误。
# quality/volatile.acidity boxplot
qplot(x = quality, y = volatile.acidity, data = wine, geom = 'boxplot')
# 根据quality显示volatile.acidity的summary
by(wine$volatile.acidity, wine$quality, summary)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
可以看出,volatile.acidity和quality呈现负相关。随着quality的提高,volatile.acidity的中位数相应的降低,但quality为7,8的变化不明显。总的来说,好的红酒volatile.acidity比较低。
# quality/sulphates boxplot
qplot(x = quality, y = sulphates, data = wine, geom = 'boxplot')
# 根据quality显示sulphates的summary
by(wine$sulphates, wine$quality, summary)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
可以看出随着quality的提高,sulphates也相应提高。但quality为5,6的样本中出现很多的异常值,也许是由于样本的错误,所以我们不能说sulphates和quality有相关性,只能说sulphates可能对红酒口味有影响。
# quality/citric.acid boxplot
qplot(x = quality, y = citric.acid, data = wine, geom = 'boxplot')
# 根据quality显示citric.acid的summary
by(wine$citric.acid, wine$quality, summary)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
可以看出随着红酒quality的提高,citric.acid也相应提高,他们是正相关的。一个有趣的现象,quality为3,4的,quality为5,6,quality为7,8的中位数很接近。
# boxplot for the others
qplot(x = quality, y = fixed.acidity, data = wine, geom = 'boxplot') +
ylim(quantile(wine$fixed.acidity, 0.05), quantile(wine$fixed.acidity, 0.95))
## Warning: Removed 149 rows containing non-finite values (stat_boxplot).
qplot(x = quality, y = residual.sugar, data = wine, geom = 'boxplot') +
ylim(0, quantile(wine$residual.sugar, 0.95))
## Warning: Removed 79 rows containing non-finite values (stat_boxplot).
qplot(x = quality, y = chlorides, data = wine, geom = 'boxplot') +
ylim(quantile(wine$chlorides, 0.05), quantile(wine$chlorides, 0.95))
## Warning: Removed 171 rows containing non-finite values (stat_boxplot).
qplot(x = quality, y = free.sulfur.dioxide, data = wine, geom = 'boxplot') +
ylim(0, quantile(wine$free.sulfur.dioxide, 0.95))
## Warning: Removed 77 rows containing non-finite values (stat_boxplot).
qplot(x = quality, y = total.sulfur.dioxide, data = wine, geom = 'boxplot') +
ylim(0, quantile(wine$total.sulfur.dioxide, 0.95))
## Warning: Removed 80 rows containing non-finite values (stat_boxplot).
qplot(x = quality, y = density, data = wine, geom = 'boxplot')
ylim(quantile(wine$density, 0.05), quantile(wine$density, 0.95))
## <ScaleContinuousPosition>
## Range:
## Limits: 0.994 -- 1
qplot(x = quality, y = pH, data = wine, geom = 'boxplot')
ylim(quantile(wine$pH, 0.05), quantile(wine$pH, 0.95))
## <ScaleContinuousPosition>
## Range:
## Limits: 3.06 -- 3.57
可以看到,density,pH,fixed.acidity和quality直接也有相关性,quality高的红酒相应的fixed.acidity也高,quality高的红酒相应的density和pH低
从相关性矩阵,可以看出其他非quality变量直接也有相关性 1. Fixed acidity vs citric acid (0.67) 2. Volatile acidity vs citric acid (-0.55) 3. Fixed acidity vs density (0.67) 4. Fixed acidity vs pH (-0.68) 5. Citric acid vs pH (0.67) 6. Free sulfur dioxide vs total sulfur dioxide (0.67)
# scatterplot for citric acid and fixed acidity
ggplot(data = wine, aes(x = citric.acid, y = fixed.acidity)) +
geom_jitter(alpha=1/3, color = 'blue') +
geom_smooth(method='lm', color='red')
# scatterplot for citric acid and volatile acidity
ggplot(data = wine, aes(x = citric.acid, y = volatile.acidity)) +
geom_jitter(alpha=1/3, color = 'blue') +
geom_smooth(method='lm', color='red')
# scatterplot for fixed acidity and density
ggplot(data = wine, aes(x = fixed.acidity, y = density)) +
geom_jitter(alpha=1/3, color = 'blue') +
geom_smooth(method='lm', color='red')
# scatterplot for fixed acidity and pH
ggplot(data = wine, aes(x = fixed.acidity, y = pH)) +
geom_jitter(alpha=1/3, color = 'blue') +
geom_smooth(method='lm', color='red')
# scatterplot for citric acid and pH
ggplot(data = wine, aes(x = citric.acid, y = pH)) +
geom_jitter(alpha=1/3, color = 'blue') +
geom_smooth(method='lm', color='red')
# scatterplot for total and free sulfur dioxide
ggplot(data = wine, aes(x = total.sulfur.dioxide, y = free.sulfur.dioxide)) +
geom_jitter(alpha=1/3, color = 'blue') +
geom_smooth(method='lm', color='red')
散点图显示了fixed acidity和citric acid有强烈的正相关,一个增加另外一个增加;Volatile acidity和citric acid有负相关,一个增加另外一个减少;density和fixed.acidity有着强烈的正相关,一个增加另外一个增加。
pH和fixed acidity以及citric acid之间存在负相关,一个增加另外一个减少, 这个符合酸性的常识。
total sulfur dioxide和free sulfur dioxide正相关,以为total sulfur dioxide包含了free sulfur dioxide, 所以一个增加另外一个也增加。
# Plot the scatterplot for chlorides and sulphates
ggplot(data = wine, aes(x = chlorides, y = sulphates)) +
geom_jitter(alpha=1/3, color = 'blue') +
geom_smooth(method='lm', color='red')
# Plot the scatterplot for chlorides and sulphates
# which excludes the top 5% values
ggplot(data = wine, aes(x = chlorides, y = sulphates)) +
geom_jitter(alpha=1/3, color = 'blue') +
xlim(0, quantile(wine$chlorides, 0.95)) +
ylim(0, quantile(wine$sulphates, 0.95)) +
geom_smooth(method='lm', color='red')
## Warning: Removed 131 rows containing non-finite values (stat_smooth).
## Warning: Removed 134 rows containing missing values (geom_point).
# Find the correlation coefficient of chlorides/sulphates with top 5% removed
with(subset(wine, chlorides < quantile(wine$chlorides, 0.95) &
sulphates < quantile(wine$sulphates, 0.95)),
cor.test(chlorides, sulphates))
##
## Pearson's product-moment correlation
##
## data: chlorides and sulphates
## t = -2.0202, df = 1466, p-value = 0.04354
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.103573750 -0.001532528
## sample estimates:
## cor
## -0.05269068
可以看出chlorides和sulphates不是真的相关。他们的相关系数是??,但是删除5%的异常值后,相关系数变成了-0.05
Quality和alcohol(0.48),volatile acidity(-0.39), sulphates (0.25),citric acid (0.23)正相关
高质量的红酒含有酒精值也更高
高质量的红酒有更低的volatile acidity
Quality和sulphates貌似有正相关,但是当Quality为5时出现了很多异常值
低Quality(3,4)的红酒citric acid含量很低;中等Quality(5,6)的红酒大约0.25 g/dm^3的citric acid;高Quality(7,8)的红酒citric acid含量超过0.25 g/dm^3。
高Quality的红酒含有的density和pH更低。
高fixed acidity的红酒citric acid也高,更高的citric acid相应的红酒质量更高。volatile acidity和fixed acidity负相关,高volatile acidity的红酒导致红酒的quality更低。
红酒的quality和alcohol有着最强的相关性,从boxplot看出,alcohol越高,红酒的quality越高。
# Plot the scatterplot of citric acid and volatile acidity, color by quality
ggplot(data=wine,aes(x=citric.acid, y=volatile.acidity, color=quality)) +
geom_point(alpha=1, position='jitter') +
scale_color_brewer(type='div')
# Plot the scatterplot of citric acid and volatile acidity, facet by quality
# Also add the smoothed conditional mean to the plots
ggplot(data=wine,aes(x=citric.acid, y=volatile.acidity, color=quality)) +
geom_point(alpha=0.5, position='jitter') +
geom_smooth(method='lm') +
facet_wrap(~quality) +
scale_color_brewer(type='div') +
scale_x_continuous(breaks=c(0,0.25,0.5,0.75)) +
theme(axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10))
# Plot the scatterplot of citric acid and volatile acidity, facet by quality
# Show the smoothed conditional means in the same plot
ggplot(aes(x=citric.acid, y=volatile.acidity, color = quality),
data = wine) +
geom_point(alpha=0.2, position = 'jitter') +
geom_smooth(method='lm', se=FALSE, size=1)
上面的boxplot解释了citric acid和不同的quality之间的关系。每一类quality,citric acid和volatile acidity都是负相关。说明了下面两点 1. 高quality的红酒有更低的volatile acidity 2. 对于每一类的quality,citric acid和volatile acidity负相关
# Plot the boxplots of citric.acid/fixed.acidity by quality
qplot(x = quality, y = citric.acid/fixed.acidity, data = wine,
geom = 'boxplot')
# Plot the histogram of citric.acid/fixed.acidity, color by quality
ggplot(data = wine, aes(x=citric.acid/fixed.acidity)) +
geom_bar(aes(fill=quality))
## Warning: Computation failed in `stat_count()`:
## arguments imply differing number of rows: 429, 436, 1
在不同的quality分类下,citric acid和volatile acidity之间的关系进一步增强了。在每一类的quality下面,citric acid和volatile acidity都是负相关。使用citric acid和volatile acidity的线性模型用来预测quality。
citric acid和fixed acidity的比例,对于红酒的quality是一个很好的参考。高quality的红酒这个比例接近0.05。
# Plot the frequency polygon of citric acid
qplot(citric.acid, data = wine, color=I(color_fill), binwidth=0.01,
geom = 'freqpoly') +
ggtitle('Frequency Polygon of Citric Acid') +
xlab('Citric Acid (g / dm^3)') +
ylab('Number of Samples') +
theme(plot.title = element_text(size = 16))
citric acid出现多峰分布,有三个峰值出现在0, 0.25和0.5。样本含有大量的0值。
# Plot the scatterplot of citric acid and volatile acidity, facet by quality
# Show the smoothed conditional means in the same plot
ggplot(data = wine, aes(x=citric.acid, y=volatile.acidity,
color = quality)) +
geom_point(alpha=0.7, position = 'jitter') +
geom_smooth(method='lm', se=FALSE, size=1) +
coord_cartesian(xlim = c(0, 0.8), ylim=c(0,1.25)) +
ggtitle('Citric Acid / Volatile Acidity by Quality') +
xlab('Citric Acid (g / dm^3)') +
ylab('Volatile Acidity (g / dm^3)') +
scale_color_discrete(name="Quality") +
theme(plot.title = element_text(size = 16))
高quality的红酒有更高的citric acid和更低的volatile acidity,citric acid和volatile acidity呈负相关。 可能的原因是citric acid和volatile acidity在某种条件下会互相转换。
从图上可以看出,当volatile acidity大于1时,红酒的品质就不可能为excellent。当volatile acidity为0或者0.3时,红酒的品质有40%的可能性为excellent。但是当volatile acidity在1和1.2之间时,红酒的品质有80%的可能性为bad。然而当volatile acidity大于1.4时,红酒的品质100%是bad。因此volatile acidity是好的特征来检验红酒的品质是否为bad。